Skip to content

Conversation

@imback82
Copy link
Contributor

(Backport of #26593)

What changes were proposed in this pull request?

DataFrameNaFunctions.fill doesn't handle duplicate columns even when column names are not specified.

val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
val df = left.join(right, Seq("col1"))
df.printSchema
df.na.fill("hello").show

produces

root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col2: string (nullable = true)

org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.;
  at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1268)

The reason for the above failure is that columns are looked up with DataSet.col() which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity.

This PR updates DataFrameNaFunctions.fill such that if the columns to fill are not specified, it will resolve ambiguity gracefully by applying fill to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity).

Why are the changes needed?

If column names are not specified, fill should not fail due to ambiguity since it should still be able to apply fill to the eligible columns.

Does this PR introduce any user-facing change?

Yes, now the above example displays the following:

+----+-----+-----+
|col1| col2| col2|
+----+-----+-----+
|   1|hello|    2|
|   3|    4|hello|
+----+-----+-----+

How was this patch tested?

Added new unit tests.

@imback82
Copy link
Contributor Author

cc @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

Thank you so much, @imback82 .
cc @cloud-fan and @PavithraRamachandran

@dongjoon-hyun
Copy link
Member

Also, cc @gatorsmile

@imback82
Copy link
Contributor Author

Do you think we need to back port #26700, which fixes a very similar issue?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 30, 2020

I switched the type of SPARK-30065 to Bug. Yes, please proceed for that, too. Thanks!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. (Pending Jenkins.)
I tested the changed part and the new test cases locally.
Thank you, @imback82 !

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117587 has finished for PR 27407 at commit 776a294.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Merged to branch-2.4.

dongjoon-hyun pushed a commit that referenced this pull request Jan 31, 2020
…cate columns

(Backport of #26593)

### What changes were proposed in this pull request?

`DataFrameNaFunctions.fill` doesn't handle duplicate columns even when column names are not specified.

```Scala
val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
val df = left.join(right, Seq("col1"))
df.printSchema
df.na.fill("hello").show
```
produces
```
root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col2: string (nullable = true)

org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.;
  at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1268)
```
The reason for the above failure is that columns are looked up with `DataSet.col()` which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity.

This PR updates `DataFrameNaFunctions.fill` such that if the columns to fill are not specified, it will resolve ambiguity gracefully by applying `fill` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity).

### Why are the changes needed?

If column names are not specified, `fill` should not fail due to ambiguity since it should still be able to apply `fill` to the eligible columns.

### Does this PR introduce any user-facing change?

Yes, now the above example displays the following:
```
+----+-----+-----+
|col1| col2| col2|
+----+-----+-----+
|   1|hello|    2|
|   3|    4|hello|
+----+-----+-----+

```

### How was this patch tested?

Added new unit tests.

Closes #27407 from imback82/backport-SPARK-29890.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants